Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generation server scripts using HF accelerate and DS-inference #328

Merged
merged 33 commits into from
Sep 1, 2022

Conversation

mayank31398
Copy link
Collaborator

Currently, the scripts are working correctly.
However, there is some memory leak.
In huggingface/accelerate#614 @sgugger says that its not in accelerate which is probably true.

My hunch is a related bug: apache/mxnet#19159

@mayank31398
Copy link
Collaborator Author

This is still WIP
@stas00

Copy link
Member

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for working on refactoring and the server code, @mayank31398

  1. let's create libraries rather than having direct scripts importing from the server scripts, so both types import the same library. So probably the following structure:
bloom-accelerate-inference.py        # library sharing code
bloom-accelerate-inference-cli.py    # the original script importing from bloom-accelerate-inference.py
bloom-accelerate-inference-server.py # the server script importing from bloom-accelerate-inference.py

same for other sets. e.g., bloom-deepspeed-inference.py, bloom-deepspeed-zero-inference.py, ...

This is of course just a suggestion, I'm happy to hear other ideas.

scripts/inference/README.md Outdated Show resolved Hide resolved
@mayank31398
Copy link
Collaborator Author

@stas00, I believe the code is in good shape. You can start reviewing.
Also, I'll start updating the README for instructions tomorrow

inference/cli.py Outdated Show resolved Hide resolved
@stas00
Copy link
Member

stas00 commented Aug 17, 2022

Oh wow, you developed a whole package around this. Impressive work, @mayank31398

Let's just discuss the user-facing APIs before implementing to minimize wasting time.

Originally, these were just scripts to benchmark and demo how the various solutions can be used, but now I'm not so sure how demo-friendly these are with so many separate files.

Some design questions:

  • Why does CLI import from the server code?
  • would it really make things simpler to have one cli script for all the different systems? they are quite different after all with very different arguments - e.g. ds-zero will need nvmi path as well and flags for cpu/nvme offload? I just had those hardcoded - but there are 3 ways to invoke this one. gpu/cpu-offload/nvme-offload
  • the while loop in cli.py isn't debug friendly - can we activate it only with an explicit flag, but by default just use pre-made inputs - or perhaps 2 separate scripts? I'm not even sure who would ever want to manually type text - perhaps having a flag to feed it a file with prompts instead? not sure.
  • multiple input BS>1 functionality got lost in transition
  • naming of the scripts lost their mnemonics - cli.py is very ambiguous since it tells me nothing of what it does. Please see my suggestions at Add generation server scripts using HF accelerate and DS-inference #328 (review), but I suppose that the way it's going it's no longer bloom-specific either, and can probably work with any other model.

I'm also no longer sure what to do with this package. Originally the plan was to stash the original demo scripts under transformers' research files, but now it's growing towards an independent package. Not a bad thing, I'm just thinking aloud.

@mayank31398
Copy link
Collaborator Author

mayank31398 commented Aug 17, 2022

  1. I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.
  2. I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.
  3. I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.
  4. batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now).
  5. I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.

Let me know your thoughts on this. Thanks for the review :)

@mayank31398
Copy link
Collaborator Author

Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM.
I can open up another PR to support other models.
Adding so much functionality in a single PR might not be the best thing to do.

@stas00
Copy link
Member

stas00 commented Aug 17, 2022

  1. I found it easier to deploy using DeepSpeed-MII and leverage that for CLI. But I wan't really sure of the overhead it causes, so still using the barebones DS-inference for benchmarking. Using DeepSpeed-MII lets me get away with doing broadcast operation of the input. Also, for code modification (if required down the line), this might be simpler.

so you are not benchmarking the same thing then.

you don't need to broadcast anything for non-server usage. It sounds like you built cli.py as a client to the server, but that's too complicated for someone who just wants to process a few inputs.

The intention of the cli was the way it was in the original code - tokenize + generate + detokenize - done.

Perhaps we have miscommunicated here.

2. I am not sure if we should add DS-ZeRO to CLI. The latency for me (batch size = 1 and output tokens = 100) is 23 sec for accelerate, 7 sec for DS-inference and 230 sec for DS-ZeRO. So, it doesn't make sense to add that. Its nice to have it in benchmarking though.

I answered earlier in the code comment.

3. I think adding a --input_file flag to cli script might be useful. If not provided, we can launch in interactive mode. I will try to make this debug friendly. Currently, it crashes.

I meant the script should work out of the box and require no user input. I think we have a conflict of use cases here. I wanted an easy to understand and replicate demo, you want a standalone system.

Perhaps the right solution is to just have the original scripts as they are for the demo/benchmarking purpose and then your package that is for the production use. What do you think? There will be a bit of duplication, but as you're saying you don't care for solutions that are not fast enough, which is fair for your needs.

What crashes?

4. batch_size > 1 still works, in both server mode and benchmark script. Not in CLI (since I am expecting user to input via command line for now).

5. I was having trouble with the naming convention (because of hyphens in name, couldn't import packages). We can rename cli to interactive or something.

for end user scripts there should be no need to import them.

Also, I am not sure if this works for other models (a few strings are hardcoded). I think for now we should stick to BLOOM.
I can open up another PR to support other models.
Adding so much functionality in a single PR might not be the best thing to do.

agreed, let's stick to bloom for now. And that's why I suggested that then the naming of the scripts should indicate that it's bloom and your version doesn't.

@mayank31398
Copy link
Collaborator Author

mayank31398 commented Aug 17, 2022

you don't need to broadcast anything for non-server usage. It sounds like you built cli.py as a client to the server, but that's too complicated for someone who just wants to process a few inputs.

cli works this way. The user only needs to provide input_text. Also, I need to call generate method on all 8 GPU processes (in non-server usage), and since the input text will only be requested on 1 process. You need to broadcast it as far as I understand.
Yes cli for ds-inference is built as a client to the server. I can change it though if we want.
In the original ds-inference script @stas00 , you didnt need to broadcast since each process was creating the list.

I will incorporate your suggestions.

@mayank31398
Copy link
Collaborator Author

So, just to summarize for cli.py:
You want the option for user to provide an input text file?

@stas00
Copy link
Member

stas00 commented Aug 24, 2022

I feel like empty prompt is fine. This is basically unconditional generation right?

Good question - I have never tried it - I guess it should work.

@mayank31398
Copy link
Collaborator Author

Also, @stas00 I have been meaning to ask.
Is accelerate using DeepSpeed ZeRO as its backend?
Because if that is the case then the generation time per batch for both of them differ greatly right?

@stas00
Copy link
Member

stas00 commented Aug 24, 2022

Is accelerate using DeepSpeed ZeRO as its backend?

It can but it's not using it in our scripts here. It uses a naive pipeline parallelism, assigning different layers to different devices and switching between those.

@mayank31398
Copy link
Collaborator Author

@stas00 , while working on this, found a really interesting bug. Could be a severe one.
I have provided a minimal working example.

@mayank31398
Copy link
Collaborator Author

from huggingface_hub import snapshot_download

  1. I am still using the old method (it is still working for >=0.9.0) if the user inputs the model_name as bigscience/bloom in ds-inference. I am doing this because snapshot _download is not respecting the TRANSFORMERS_CACHE and I don't want it to download duplicate checkpoints (for bigscience/bloom). However, this is slow (reshards the 72 files)
  2. For the newer weights provided by microsoft, I am using snapshot_download. Also, I am adding support for int8 as well in this PR. However, this is much faster.

I feel like this is the best approach for now. If we want to change this we can do that in a separate PR in the future.

@stas00

@stas00
Copy link
Member

stas00 commented Aug 28, 2022

because snapshot _download is not respecting the TRANSFORMERS_CACHE

If you use HUGGINGFACE_HUB_CACHE instead of TRANSFORMERS_CACHE it would work for either.

Alternatively you can do:

HUGGINGFACE_HUB_CACHE = os.getenv("HUGGINGFACE_HUB_CACHE", None)
cache_dir =  = os.getenv("TRANSFORMERS_CACHE", HUGGINGFACE_HUB_CACHE)
snapshot_download(..., cache_dir=cache_dir)

If neither is set, None would lead to the default path.

You can see:

https://github.com/huggingface/transformers/blob/21f6f58721dd9154357576be6de54eefef1f1818/src/transformers/utils/hub.py#L92

@stas00
Copy link
Member

stas00 commented Aug 28, 2022

Actually, the first line isn't needed, as it's already the default:
https://github.com/huggingface/huggingface_hub/blob/7aadc3243d9478a48e17f568d8d4950cafd11ca2/src/huggingface_hub/_snapshot_download.py#L103-L104

so just:

cache_dir =  = os.getenv("TRANSFORMERS_CACHE", None)
snapshot_download(..., cache_dir=cache_dir)

@mayank31398
Copy link
Collaborator Author

DeepSpeed-MII not working with new Microsoft weights: microsoft/DeepSpeed-MII#50

@mayank31398
Copy link
Collaborator Author

mayank31398 commented Aug 31, 2022

@stas00 https://huggingface.co/bigscience/bloom/tree/main
Some changes were pushed to bigscience/bloom just 3 hours ago.
snapshot_download fails as a result of this change.

  File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/ds_zero/model.py", line 17, in __init__
    return snapshot_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 93, in inner_f
    downloaded_model_path = get_downloaded_model_path(args.model_name)
  File "/net/llm-shared-nfs/nfs/mayank/BigScience-Megatron-DeepSpeed/scripts/bloom-inference-server/utils/model.py", line 105, in get_downloaded_model_path
    return f(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 192, in snapshot_download
    return snapshot_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 93, in inner_f
    _ = hf_hub_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1218, in hf_hub_download
    return f(*args, **kwargs)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/_snapshot_download.py", line 192, in snapshot_download
    _ = hf_hub_download(
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1218, in hf_hub_download
    _create_relative_symlink(blob_path, pointer_path)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 837, in _create_relative_symlink
    _create_relative_symlink(blob_path, pointer_path)
  File "/net/llm-shared-nfs/nfs/yelkurdi/conda/miniconda3/envs/llmpt/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 837, in _create_relative_symlink
    os.symlink(relative_src, dst)
FileExistsError: [Errno 17] File exists: '../../../../blobs/c4072f2f02baee6f0d50739cf14439f61e0605913ed65606f9bcb11043838a2e' -> '/home/gpttest/nfs/mayank/HF_cache/models--bigscience--bloom/snapshots/874fa447d586d0a28ab66aa3ca49da43ae1ec0f6/tensorboard/main/events.out.tfevents.1649441015.jean-zay-iam50.156467.0'

Any idea how to get around this?
Also, if this is fixed, would snapshot_download download the whole checkpoints again?
This would be highly impractical right?

Please let me know this. I am not sure of a good solution.
Also, clearing cache for every commit might not be practical.

@stas00
Copy link
Member

stas00 commented Aug 31, 2022

it looks that only README.md was updated, so it's weird that it'd fail.

I just run:

python -c "from huggingfhub import snapshot_download; snapshot_download('bigscience/bloom')"

and it updated w/o any issues.

Probably move that file it complains about for now? Could be some bug in huggingface_hub?

it shouldn't redownload anything but the new or updated files.

@stas00
Copy link
Member

stas00 commented Aug 31, 2022

@mayank31398, please also note the renames in this commit https://github.com/microsoft/DeepSpeed/pull/2217/files/c77f5e0e385db405475b2842a6f5725c0f551170..f3f4b1dda9f9ba09042b8b63bbee475c9364bc56 in particular s/mp_size/tp_size/

this is now merged so we should update our side - you can just do a global search and replace on all of the files under scripts - or if you'd prefer I do that please let me know - don't want to step on your toes if you're still on the old DS branch.

@mayank31398
Copy link
Collaborator Author

mayank31398 commented Sep 1, 2022

Let me try updating to the latest HF_hub, and deepsped to see if that fixes things.
Nvm, I am on the latest version.
I agree this should work. @stas00
Its working fine for GPT-2.
No idea what is going wrong though for BLOOM. For now, Ill just remove this file.

@mayank31398
Copy link
Collaborator Author

Nevermind, I have fixed this. This was a bug because all 8 processed in my DeepSpeed code were trying to download simultaneously

@stas00
Copy link
Member

stas00 commented Sep 1, 2022

oddly enough It doesn't look like mp_size was renamed to tp_size, which is odd, so I only had to modify this line with the updated checkpoints:

checkpoints_json = os.path.join(repo_root, "ds_inference_config.json")

@stas00
Copy link
Member

stas00 commented Sep 1, 2022

@mayank31398, would it be OK if we merged this PR - and then perhaps have another iteration of tweaks in subsequent PRs if needed? As we are renaming files I want the rename to be merged first before tweaking those.

You have the commit rights, so you can do the honours if you're in agreement.

Thank you!

Copy link
Member

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this

@mayank31398
Copy link
Collaborator Author

Yes @stas00 , currently, MII is not working with DS-inference. But we can merge and I can continue to work on this.

@mayank31398 mayank31398 merged commit f9402d0 into main Sep 1, 2022
@mayank31398 mayank31398 deleted the add-generation-server branch September 1, 2022 17:06
@mayank31398
Copy link
Collaborator Author

Also, @stas00 there are some pieces of code that both the server folder and the scripts folder can share.
I will work on getting the reduntant code out once all the stuff in server stuff is fixed :)

@stas00
Copy link
Member

stas00 commented Sep 1, 2022

perfect, then you can start a new PR once you have some new edits. We can just do it in iterations.

I need to edit the original scripts to support the new repos. will do in another PR.

younesbelkada pushed a commit to younesbelkada/Megatron-DeepSpeed that referenced this pull request Sep 28, 2022
…igscience-workshop#328)

* first step towards making libs

* HF accelerate model

* refactor accelerate

* refactor DS inference

* refactor DS ZeRO

* make inference library

* cli

* server

* request

* remove MaxTokensError

* fix batch size error with DS inference server

* type fix

* add latency

* add latency

* add min_length to default kwargs

* str kwargs

* str kwargs

* fix comma

* add old scripts back

* move scripts

* drop data

* minor changes + add README

* update README

* drop nccl

* fix

* default values

* resolve issues

* handle keyboard interrupt

* remove caching

* use snapshot_download

* make server class

* fix snapshot download

Co-authored-by: Mayank Mishra <mayank31398@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants